Statistical mechanics of RNA folding: importance of alphabet size 
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We construct a base-stacking model of RNA secondary-structure formation and use it to study 
the mapping from sequence to structure. There are strong, quaUtative differences between two- 
letter and four or six-letter alphabets. With only two kinds of bases, most sequences have many 
alternative folding configurations and are consequently thermally unstable. Stable ground states 
are found only for a small set of structures of high designability, i.e. total number of associated 
sequences. In contrast, sequences made from four bases, as found in nature, or six bases have far 
fewer competing folding configurations, resulting in a much greater average stability of the ground 
state. 



I. INTRODUCTION 

RNA plays a central role in molecular biology. In ad- 
dition to transmitting genetic information from DNA 
to proteins, RNA molecules participate actively in a 
variety of cellular processes T^. Examples are found 
in translation (rRNA, tRNA, and tmRNA), editing of 
mRNA, intracellular protein targeting, nuclear splicing of 
pre-mRNA, and X-cliromosome inactivation. The RNA 
molecules involved in these processes do not code for pro- 
teins but act as functional products in their own right. 
In addition, RNA molecules prepared in vitro can be se- 
lected to bind to specific molecules such as ATP In 
all these cases, the information encoded in the sequence 
of nucleotide bases of each RNA molecule determines its 
functional three-dimensional structure. The nucleotide 
sequence is a kind of genotype, i.e., hereditary informa- 
tion, while the folded three-dimensional structure rep- 
resents phenotype, the physical characteristics on which 
natural selection operates. The mapping from genotype 
to phenotype bears on how biological systems evolve, and 
RNA folding probably constitutes the simplest example 
of this mapping j3[. Since early life is believed to have 
been RNA based RNA folding can provide us with 
important clues about early life and evolution. 

RNA is a polynucleotide chain consisting of the four 
bases: A, U, G, and C. Complementary base pairs (A-U 
and G-C) can stack to form "stems" which are helical seg- 
ments similar to the double helix of DNA. These helices, 
called secondary structures, are generally arranged in 
a three-dimensional tertiary structure, stabilized by the 
much weaker interactions between the helices. Represen- 
tations of secondary structures are shown in Fig. 1. The 
energy contributions of secondary and tertiary structures 
are hierarchical^, with secondary structures largely de- 
termining tertiary folding. Secondary structure is fre- 
quently conserved in evolution, and structural homology 
has been used successfully to predict function 0|. 

In this paper, we investigate the role of alphabet size in 



the statistical mechanics and selection of RNA secondary 
structures. We find pronounced differences between two- 
letter and four or six-letter alphabets. For sequences con- 
structed with two types of bases, only a small fraction of 
sequences have thermodynamically stable ground-state 
structures; these structures are also highly designable, 
i.e., have a large number of associated sequences. Four 
and six-letter sequences are much more stable on average, 
but exhibit no strong correlation between designability 
and thermodynamic stability. We trace this difference to 
the greater likelihood of competing, alternatively paired 
configurations when a two-letter alphabet is used. 

For RNA, there alreadyexist algorithms that predict 
secondary structures 0, Q . These algorithms are in- 
tended to apply to real RNA and, consequently, involve 
a large number of parameters for the different pairing 
and stacking combinations. Using one of these algo- 
rithms, Fontana et al. found a broad distribution of 
designabilities, i.e. number of sequences per structure, 
after structures were grouped by topology. In this pa- 
per, we present, instead, a much simpler model for RNA 
secondary structure designed to elucidate the role of al- 
phabet size. 

The organization of this paper is as follows. In section 
II, we present a base-stacking model for RNA secondary 
structure and outline the recursive algorithm used to 
compute the partition function and ground-state struc- 
ture. In section III, we employ our model to analyze 
the stability of folded structures. We find a significant 
difference in stability between two-letter and four or six- 
letter sequences due to the greater likelihood of alter- 
native folds in the two-letter case. As a consequence 
of these alternative folds, in the two-letter case, stabil- 
ity correlates with designability, i.e., total number of se- 
quences associated with a structure. In addition, we find 
that RNA sequences folding to a given structure form a 
percolating neutral network. Finally, in section IV, we 
summarize our main conclusions. 



II. BASE-STACKING MODEL 
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We introduce a base-stacking model for RNA 
secondary-structure formation. It is known that, within 
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a stem of base pairs, the largest energy contribution is the 
stacking energy between two adjacent base pairs (rather 
than the base-pairing energy itself) and the total energy 
of the stem is the sum of stacking energies over all ad- 
jacent base pairs 0. A single stack -I- 1; j — 1, j) is 
defined as two adjacent non-overlapping base pairs 
and {i + l, j ~ 1) where i + 1 < j — 1. For this stack 
-I- l;j — we assign an energy —Eg if and 
{i + — 1) are both complementary Watson-Crick base 
pairs and zero otherwise. We thus neglect differences in 
energy between, for example, (A,A;U,U), (A,G;C,U) and 
(G,G;C,C) stacks. We also neglect energy contributions 
from isolated base pairs that are not part of a stack, and, 
consequently, do not include isolated base pairs in the 
secondary structure. 

The largest entropic contribution to an RNA structure 
comes from stretches of unpaired bases. We incorporate a 
simplified version of this polymer configurational entropy 
in our model by associating a degrees of freedom with 
every unpaired base. Thus, the restricted partition func- 
tion, corresponding to all micro-states compatible with a 
given secondary structure is 



a " exp 



(2.1) 



where n„ is the number of unpaired bases, is the num- 
ber of stacks, and T is the temperature. The restricted 
free energy is F^icro = -kBTln Z„^icio = -EsTIs - 
ksTriu In a. 

In this model, since only complementary base pairs can 
participate in a stack, only a fraction of possible struc- 
tures are compatible with any given sequence. However, 
provided the structure is compatible with the sequence, 
its restricted free energy is independent of the sequence. 

The change in free energy due to the formation of an 
isolated stack is —Eg + AkBTlna; the first term cor- 
responds to the stacking energy and the second to the 
loss in configurational entropy (since four bases par- 
ticipate in the stack). For every additional adjacent 
stack the change in free energy is —Eg + 2kBT\na, 
since only two bases are added to the stack. If, for 
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example, Eg < AkBTlna but 2Es > GfesTlna {i.e., 
•iksTlna < Eg < 4A:BTlna), then formation of an iso- 
lated stack would be unfavorable but formation of a seg- 
ment consisting of two or more adjacent stacks would 
be favored by a net decrease in free energy. Thus, for 
an appropriate choice of parameters, the model correctly 
provides a nucleation cost to the formation of stems. For 
this paper we choose Ina = 1.5 and Eg = b.bksT, which 
are physically motivated and correspond to a nucleation 
cost for the formation of an isolated stack, with a min- 
imum of two adjacent stacks required to form a stable 
stem. Our results, however, do not depend sensitively on 
the choice of these parameters. 

In the secondary structure, any two base pairs {ii,ji) 
and (^2, j2), with ii < 12, are either nested {ii < i2 < 
j2 < ji) or independent («i < ji < 12 < j2)- Other 
possibilities correspond to "pseudoknots" , which are en- 
ergetically and kinetically suppressed. It is customary to 
regard pseudoknots as part of the tertiary structure, and 
we do not include them here. 

In order to compute the ground-state structure and 
partition function for a given sequence, we make use of 
the hierarchical nature of secondary structures (due to 
the absence of pseudoknots). We use a recursive algo- 
rithm that is a generalization of the techniques described 
in Refs. 9] and ^IQ] . Consider the partition function 
Zij for a segment of bases from the position i to j > i. 
The base j is either unpaired or can be part of a stack 
{k,k + l;j — l,j) with k G {i, . . . ,j — 3}. Thus Zij obeys: 
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■Pg{k,k+l;j-l,j)-Zk+2;j-2], (2.2) 

where Vg{k,k -I- 1; j — l,j) equals 1 if both (fc,j) and 
{k+l,j — \) are complementary base pairs, and equals 
otherwise; Zi^i^i is defined to equal 1. We have intro- 
duced Zi,j which is the partition function for the segment 
with the boundary condition that sites i — 1 and j + 1 are 
paired, implying an energy —Eg for the formation of a 
bond between the bases at sites i and j. We thus require 
a second recursion relation for Z: 
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Vb{i,j) ■ Zi+ij-i 



(2.3) 



aZij^i + e 

E [Z,,fc_ie^=/'=-^7'3(A;,fc-f l;j - l,j) • Zk+2;j-2], 



FIG. 1: Representations of RNA secondary structures: (a) 
flattened-helix diagram of a 16-base structure, and (b) rain- 
bow diagram of the same structure. The restriction that 
arches do not cross in the rainbow diagram implies the ab- 
sence of pseudoknots. 



where Vb{i,j) equals 1 if («,j) are complementary base 
pairs and otherwise. The partition function Zi^n can be 
computed recursively using (2) and (3) in 0{N^) steps. 
We use a similar recursive algorithm to compute the 
ground-state structure Si^n. 
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III. RESULTS 
A. Dependence on Alphabet Size 

We have employed our model to analyze the stabil- 
ity of folded structures corresponding to two, four, and 
six-letter sequences. The thermodynamic stability is de- 
fined as the probability Pqs that the sequence will be 
found in the ground state, Pqs = ^-Fg^I^bT ji^ where 
Fqs is the free energy associated with the ground state. 
Fig. 2 shows a histogram of stability for 40-nucleotide 
long sequences with ground states containing 12 to 15 
stacks |ll|. We find four- letter sequences considerably 
more stable on average than two-letter sequences. 

What is the origin of the difference in stabilities be- 
tween two-letter and four-letter RNA sequences? In or- 
der to address this question, we classify the excited-state 
structures as (i) those formed by breaking existing pairs, 
and (ii) those formed by re-pairing, i.e., by forming new 
pairs in addition to breaking existing pairs. Independent 
of alphabet size, all sequences folding into a given sec- 
ondary structure S have the same set of "pair-breaking" 
excited states. The sequence dependence of stability 
for a given ground-state structure results entirely from 
re-pairings. The crucial difference between two-letter 
and four sequences lies in the substantially greater likeli- 
hood of "re-paired" excited states for two-letter sequences. 
This follows because the number of pairs one can form in 
a random sequence of two letters is typically much larger 
than for a four or six-letter sequence of the same length. 
For example, for a random four-letter sequence of length 
iV, the probability of forming a stem involving sites i to 
i-\-l and j — I to j is lower by a factor of 2' as compared to 
a random two-letter sequence of the same length. For the 
same reason, the fraction of sequences that have highly 
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FIG. 2: Histogram of stabilities for: (a) 40-nucleotide long 
two-letter sequences, and (b) 40-nucleotide long four-letter 
sequences. 



stacked ground states is much greater for two-letter se- 
quences than for four-letter sequences, and much greater 
for four than for six. 

To demonstrate the importance of "re-paired' ex- 
cited states, we first calculate a "pair-breaking" stabil- 
ity -Pmax — e~^°^/*''^"^/Z where Z is a pair-breaking 
partition function calculated by considering only pair- 
broken excited states. Pmax gives us an upper bound 
to the true stability, i.e., probability in ground state 
PgSj that includes competition from re-paired states. In 
Fig. 3, we plot the true average stabihty (Pes) against 
the pair-breaking stability Pmax for two, four, and six- 
letter sequences. As expected, the average stability is 
much closer to the maximum set by pair breaking in the 
case of four-letter sequences than in the case of two- letter 
sequences. Thus, structures constructed with four-letter 
sequences are typically much more than stable than those 
constructed with two letters, and six-letter sequences are 
typically more stable than four-letter ones. For folding 
kinetics, it is these same "re-paired" states that act as 
kinetic traps. Due to the lower likelihood of such states, 
we expect four and six-letter sequences to typically fold 
faster than two-letter sequences. 



B. Stability and Designability 

What determines the average stability, (Pes), of two- 
letter sequences? We have seen that for four and six- 
letter sequences the average stability is close to the "pair- 
breaking" stability which is determined largely by the 
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FIG. 3: Average of actual probability in ground state vs. up- 
per bound Pmax allowing only for breaking of base pairs, plot- 
ted for 40-nucleotide sequences, -l-'s denote RNA sequences 
constructed from two types of bases, o's denote those con- 
structed from four, and x's denote sequences constructed 
from six types of bases. The actual probability is averaged 
over sequences with the same pair-breaking stability Pmax 
(which is sequence independent). 
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number of stems and loops. Insight into the stability 
of two-letter RNA sequences comes from results in pro- 
tein folding [11 El El El El 113 ■ Based on solvation 
models with differing hydrophobicities of amino acids, a 
principle of designability has emerged for protein folding. 
The designability of a structure is measured by the num- 
ber of sequences folding uniquely into that structure. A 
small class of protein stuctures emerge as being highly 
designable; remarkably, the same class of structures are 
highly d esig nable whether two or all 20 amino-acid types 
are used[l3- In a wide range of protein models, se- 
quences associated with highly desig nable structures are 
thermodynamically more stable[il E3 ^.nd fold faster 
than typical sequences El- This connection between the 
designability of a structure and the stability of its asso- 
ciated sequences is referred to as the designability prin- 
ciple. The designability principle reflects a competition 
among structures. In solvation models, sequences will 
fold to structures which best match their hydrophobic 
amino acids to buried sites in the structure (shielded 
from water). Highly designable structures are those with 
unusual patterns of surface exposure, and therefore few 
competitors. This lack of competitors also implies that 
the sequences folding to such structures are thermally 
stable. We will now show that the designability principle 
also holds for two-letter RNA. 
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is more noisy for RNA than it is for proteins; so we plot 
the integrated distribution of designabilities. The most 
designable structure consists of a stem with a hairpin 
loop, and a dangling end. We have also studied longer 
sequences, of lengths 40 and 50, for which we sample 
sequence space. For 40-nucleotide sequences, the most 
designable structures consist of a single hairpin loop and 
dangling ends; a number of double hairpin structures 
are also highly designable (Fig. 5). For sequences of 
length 50, double hairpin structures emerge as the most 
designable. Finally, we find a pronounced correlation 
between designability and stability of RNA structures. 
This is shown in Fig. 4 |21| for 24-nucleotide sequences. 
Thus, two-lette RNA sequences which fold into highly 
designable secondary structures are unusually thermally 
stable, verifying the designability principle. 

In contrast, for four-letter sequences the range of des- 
ignabilities is narrower and there is only a weak correla- 
tion between designability and stability, with highly sta- 
ble sequences existing for structures of both high and 
low designability (Fig. 6). The results for six letters are 




FIG. 5: A few highly designable structures for 40-nucleotide 
long two-letter RNA sequences. 
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FIG. 4: Stability [20] versus designability TVs (in logarithmic 
scale) for 24-base RNA sequences constructed with two types 
of bases. In the inset we plot fraction of compact structures 
[19] with designability above Ns versus Ns for two and four- 
letter RNA sequences. 
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For two base- types (say A and U), we enumerate all 
sequences and structures of length 24. We find that sec- 
ondary structures differ considerably in their designabil- 
ity; there are highly designable structures which are 
ground states of a large number of sequences, and there 
are poorly designable structures which are ground states 
of only a few sequences (cf. Fig. 4 inset). In this re- 
spect, the results for two-letter sequences are similar to 
those for protein modelsflsLll^. However, the histogram 
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FIG. 6: Stability [20] versus designability Ns (in logarithmic 
scale) for 24-base RNA sequences constructed with four types 
of bases. We find no significant correlation between designi- 
bafity and stability for four letters. 
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FIG. 7: For some given 24-nucleotide two letter sequence ai 
we plot (a) a histogram of the distances to all two-letter se- 
quences with the same ground state structure, and (b) a his- 
togram of the distances to all two-letter sequences. Histogram 
(b) is independent of the choice of ai. Histogram (a) is also 
roughly independent of sequence ai provided its ground-state 
structure is highly designable. 



similar. We trace this difference between two and four 
or six-letter sequences to the likelihood of competing re- 
paired states. For two letters, the correlation between 
designabihty and stability (as well as the nontrivial dis- 
tribution of designabilities) arises primarily from compet- 
ing re-paired states. Four and six-letter sequences have 
far fewer competing re-paired states and hence do not 
demonstrate significant correlation between designabihty 
and stability. 

C. Neutral Networks 

Finally we consider the "neutral network" of RNA se- 
quences which fold to a particular structure. The con- 
nectivity within a network and the shortest distance be- 
tween networks has drawn considerable attention with 
respect to the evolvability of RNA structureS[St. i22j. In 
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our model, the network of sequences which fold to a par- 
ticular structure is truly "neutral" in that all sequences 
have the same ground-state free energy Fgs, albeit with 
different stabilities because of repairing. (This contrasts 
with protein solvation models in which, independent of 
competing structures, there is typically an energy hi- 
erarchy of sequences for each structure, determined by 
the match between hydrophobicity and surface-exposure 
pattern^3|-) In our model, RNA sequences that fold 
to a given structure form, in general, a percolating and 
non-compact network in sequence space. In particular, a 
histogram of the distances between sequences folding to 
the same highly designable structure is actually broader 
than a histogram of the distances between all sequences 
(Fig. 7)'23'|. In this respect, the RNA model differs con- 
siderably from protein models. 



IV. CONCLUSIONS 

To conclude, in this paper we developed and studied a 
minimalist base-stacking model of RNA secondary struc- 
ture. We found that sequences constructed with four or 
six types of bases typically have fewer competing excited 
states, and, consequently, have greater ground-state sta- 
bility, compared to sequences constructed with two types 
-of bases. At the same time, the fraction of sequences 
with highly stacked ground states is much smaller for 
four-letter sequences than for two, and much smaller for 
six letters than for four. It is tempting to speculate that 
four letters optimizes the stability of structures while 
maintaining a reasonable probability that a random se- 
quence folds into a highly stacked structure. If, as has 
been postulated, early life was indeed RNA based and 
double-stranded DNA came later in evolution, our ob- 
servations might plausibly bear on nature's choice of four 
letters for the genetic code. 
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